Ijraset Journal For Research in Applied Science and Engineering Technology
Authors: Dr. Nisha Auti, Sumit Ranaware, Shreeraj Ghadge, Rajdatta Jadhav, Prajwal Jagtap
DOI Link: https://doi.org/10.22214/ijraset.2023.52206
Certificate: View Certificate
Hate speech is a crime that has been increasing in recent years, not only in person but also online. There are several causes for this. There is tremendous growth in social media that promotes full freedom of expression through anonymity features. Freedom of expression is a human right, but hate speech directed at individuals or groups on the basis of race, caste, religion, ethnicity or nationality, gender, disability, gender identity, etc. is a violation of that sovereignty. Freedom of expression is a human right, but hate speech directed at individuals or groups on the basis of race, caste, religion, ethnicity or nationality, gender, disability, gender identity, etc. is a violation of that sovereignty. It promotes violence and hate crimes, creates social imbalances, and undermines peace, trust and human rights. Revealing hate speech in social media discourse is a very important but complex task. On the one hand, the anonymity provided by the Internet, especially social networks, makes people more likely to engage in hostile behaviour. On the other hand, the desire to express one\'s thoughts on the Internet has increased, leading to the spread of hate speech. Governments and social media platforms can benefit from detection and prevention technologies, as this kind of bigoted language can wreak havoc on society. We help resolve this dilemma by providing a systematic overview of research on this topic in this survey. This project aims to accurately predict various forms by addressing different categories of hate individually and examining a set of text mining functions. Hate speech detection
I. INTRODUCTION
Hate speech is a crime that has been on the rise in recent years, not just in face-to-face contacts but also online. Social media is exploding in popularity, and its anonymity aspect fully fosters freedom of expression.
Hate speech directed at an individual or group based on race, caste, religion, ethnic or national origin, sex, handicap, gender identity, or other factors is an abuse of this sovereignty.
It actively promotes violence or hate crimes and disrupts society by jeopardizing peace, credibility, and human rights, among other things. Detecting hate speech in social media discourse is crucial, but it's a difficult undertaking.
This study aims to address the quality of datasets, which is a major concern raised by many of the problems that have been brought to light.
This paper also addresses the second issue, which is that the best characteristics for hate speech identification must be investigated and determined before developing a suitable classifier. For this reason, datasets tend to fall into one of these categories.
The work is divided into two parts: Hate speech tweets are categorized into five types.
II. LITRATURE SURVEY
III. SYSTEM ARCHITECTURE
IV. MODULE EXPLANATION
A. Dataset Preprocessing
Dataset preprocessing refers to the steps taken to prepare raw data for analysis or modeling. It involves transforming the data into a format that can be easily understood and used by machine learning algorithms. The goal of dataset preprocessing is to clean and transform raw data so that it is more accurate, consistent, and usable.
The preprocessing steps can include:
These preprocessing steps are essential for ensuring that the data used for analysis or modeling is accurate, consistent, and reliable. By preparing the data properly, we can improve the performance of machine learning algorithms, reduce errors, and generate more accurate results.
B. Feature Engineering
During the exploratory data analysis, it is found that many attributes of comments outside of the words themselves may be useful in predicting whether they are toxic. The features added to the dataset are:
???????C. Feature Extraction
Feature extraction for hate speech classification involves identifying and extracting relevant features from text data that can be used to distinguish between hate speech and non-hate speech. Here is a general approach to feature extraction for hate speech classification:
Tokenization: The next step is to break down the text into Feature Extraction aims to reduce the number of features in a dataset by creating new features from the existing ones (and then discarding the original features). These new reduced set of features should then be able to summarize most of the information contained in the original set of features. In this way, as summarized version of the original features can be created from a combination of the original set.
D. Classification
The purpose of classification in machine learning (ML) is to develop algorithms that can automatically assign predefined categories or labels to new, unseen data based on patterns and relationships learned from labeled training data. The goal is to build a predictive model that can accurately classify or categorize data instances into distinct classes.
Classification is defined as the process of recognition, understanding, and grouping of objects and ideas into pre-set categories also known as “sub-populations.” With the help of these pre-categorized training datasets, classification in machine learning programs leverage a wide range of algorithms to classify future datasets into respective and relevant categories.
???????E. Vectorization
In this using a term frequency – inverse document frequency (tf-idf) statistic to vectorize text. The number of features and presence of character n-grams is a parameter to tune for model optimization.
Vectorization in the context of hate speech detection refers to the process of representing text data in a numerical format that machine learning algorithms can understand and process. It involves converting textual information into numerical vectors that capture the semantic and syntactic properties of the text. Here's an explanation of vectorization for hate speech detection:
???????F. Feature Scaling
The engineered features are normalized from 0.0 to 1.0. The tf-idf features are not scaled. Feature scaling is the process of standardizing or normalizing the numerical features in a dataset to ensure that they have the same scale and range. It is a common pre-processing step in machine learning, including for hate speech detection. Here's an explanation of feature scaling for hate speech detection:
V. MOTIVATION
Today, social networking sites involve billions of users around the world.
VI. OBJECTIVE OF THE SYSTEM
VII. METHODOLOGY
In the proposed systems approach, we formulate a problem classifying task to identify and mitigate the side effects of public shame on networks.
Two major contributions:
The goal is to automatically classify tweets into 8 categories. For each category, the labelled training and test sets undergo pre-processing and feature extraction. The training set is used to train the random forest (RM). Tweets marked as negative by all classifiers are not considered shameful.
VIII. DATASET
The dataset contains 159,571 comments from Wikipedia. The data consists of one input feature, the string data for the comments, and five labels for different categories of toxic comments: toxic, obscene, threat, insult, and identity hate.
The figure on the following page contains a breakdown of how the labels are distributed throughout the dataset
XIII. FUTURE SCOPE
XIV. ACKNOWLEDGMENT
We take this occasion to thank God, almighty for blessing me with his grace and taking our Endeavor to a successful Culmination. We extend my Sincere and heartfelt thanks to my esteemed guide, Dr. Nisha Auti and the industry people, for providing me with the right guidance and advice at the crucial junctures and for showing me the right way.
Above all, I thank the Almighty, the source of all knowledge, understanding and wisdom.
After identifying the primary challenges, the multi-class automated hate speech categorization for text problem is solved with significantly better results. Potential solution for countering the menace of online public shaming in Twitter by categorizing shaming comments in eight types, choosing appropriate features, and designing a set of classifiers to detect it. The propagation of hate speech on social media has been increasing significantly in recent years and it is recognized that effective counter-measures rely on automated data mining techniques. Our work made several contributions to this problem. First, we introduced a method for automatically classifying hate speech on Twitter using a machine learning that empirically improve classification accuracy.
[1] Muhammad Sabih “Un-Compromised Credibility: Social Media Based Multi-Class Hate Speech Classification for Text” January 2021IEEE Access 9:109465-109477 DOI:10.1109/ACCESS.2021.3101977 License CC BY-NC-ND 4.0 [2] Dris, David, Ogunseye, Elizabeth Oluyemisi and Akinola, Solomon Olalekan. (2020). “Detecting Hate Speech on social media Using Deep Learning Techniques”, University of Ibadan Journal of Science and Logics in ICT Research (UIJSLICTR), Vol. 5 No. 1, pp. 22 - 38. ©U IJSLICTR Vol. 5, No. 1, June 2020. [3] Filip Klubicka, Raquel Fernandez “Examining a hate speech corpus for hate speech detection and popularity prediction” arXiv:1805.04661v1 [cs.CL] 12 May 2018. [4] Irene Kwok and Yuzhou Wang “Locate the Hate: Detecting Tweets against Blacks” [5] Thomas Davidson, Dana Warmsley, Michael Macy, Ingmar Weber “Automated Hate Speech Detection and the Problem of Offensive Language” arXiv:1805.04661v1 [cs.CL] 12 May 2017 [6] Mohiyaddeen, Dr. Shifaulla Siddiqui “Automatic Hate Speech Detection: A Literature Review” e-ISSN: 2250-0758 | p-ISSN: 2394-6962 Volume-11, Issue-2 (April 2021) [7] Resmi Reghunathan, Asha A S “Hate Speech Detection in Conventional Language on Social Media by using Machine Learning” International Journal of Engineering Research & Technology) http://www.ijert.org ISSN: 2278-0181 Vol. 11 Issue 06, June-2022 [8] Areej Al-Hass an ,Hmood Al-Dossari “Detection of hate speech in social networks: a survey on multilingual corpus” Conference Paper · February 2019 DOI: 10.5121/csit.2019.90208 [9] Sindhu Abro , Sarang Shaikh , Zafar Ali Sajid Khan , Ghulam Mujtaba “Automatic Hate Speech Detection Using Machine Learning: A Comparative Study” (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 8, 2020. [10] Pete Burnap and Matthew L. Williams “Cyber Hate Speech on Twitter: An Application of Machine Classification and Statistical Modelling for Policy and Decision Making .” 1944-2866 # 2015 The Authors. Policy & Internet published by Wiley Periodicals, Inc. on behalf of Policy Studies Organization. [11] Sreelakshmi ka , Premjith Ba , Soman K.Pa “Detection of Hate Speech Text in Hindi English Code mixed Data” Procedia Computer Science 171 (2020) 737–744 . [12] Mathew, Binny, et al. \"Analyzing the hate and counter speech accounts on twitter.\" arXiv preprint arXiv:1812.02712 (2018). [13] Gaydhani, Aditya, et al. \"Detecting hate speech and offensive language on twitter using machine learning: An n-gram and tfidf based approach.\" arXiv preprint arXiv: 1809.08651 (2018). [14] Watanabe, Hajime, Mondher Bouazizi, and Tomoaki Ohtsuki. \"Hate speech on twitter: A pragmatic approach to collect hateful and offensive expressions and perform hate speech detection.\" IEEE access 6 (2018): 13825-13835. [15] Wich, Maximilian, Jan Bauer, and Georg Groh. \"Impact of politically biased data on hate speech classification.\" Proceedings of the Fourth Workshop on Online Abuse and Harms. 2020.J. Padhye, V. Firoiu, and D. Towsley, “A stochastic model of TCP Reno congestion avoidance and control,” Univ. of Massachusetts, Amherst, MA, CMPSCI Tech. Rep. 99-02, 1999.
Copyright © 2023 Dr. Nisha Auti, Sumit Ranaware, Shreeraj Ghadge, Rajdatta Jadhav, Prajwal Jagtap. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Paper Id : IJRASET52206
Publish Date : 2023-05-13
ISSN : 2321-9653
Publisher Name : IJRASET
DOI Link : Click Here